Wilcoxon Signed-Rank Test (paired, nonparametric)#
The Wilcoxon signed-rank test answers a common question:
When you measure the same subjects twice (before/after, left/right, pre/post), is there evidence of a systematic shift?
It is a nonparametric alternative to the paired t-test.
It works on the paired differences and uses ranks (so it’s more robust to outliers than a mean-based test).
It tests a location shift of the differences (often described as a test about the median difference, under a symmetry assumption).
Learning goals#
By the end you should be able to:
state the hypotheses and when the test is appropriate
compute the signed-rank statistic by hand (conceptually)
implement the test from scratch using only NumPy
visualize the test statistic under the null (exactly and via simulation)
interpret the p-value, direction, and an effect size
Prerequisites#
paired data / dependent samples
basic hypothesis testing (null/alternative, p-values)
familiarity with ranks (sorting, ties)
import math
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=6, suppress=True)
rng = np.random.default_rng(42)
1) What the test is (and when to use it)#
Setting#
You have paired observations ((x_i, y_i)) for (i=1,\dots,n).
Examples:
the same user’s latency before vs after an optimization
blood pressure pre vs post treatment
left vs right measurement on the same subject
Define paired differences:
[ d_i = x_i - y_i ]
Hypotheses (typical)#
Two-sided: (H_0): the distribution of (d) is symmetric around 0 (no location shift) vs (H_1): not symmetric around 0.
One-sided: (H_0) vs (H_1: ext{location}(d) > 0) or (< 0).
Under the symmetry assumption, the test is commonly described as:
[ H_0: ext{median}(d) = 0 ]
When it’s a good fit#
Use Wilcoxon signed-rank when:
samples are paired / dependent
differences are at least ordinal (you can rank magnitudes)
you don’t want to rely on normality of differences (paired t-test assumption)
When it’s not the right tool#
if the samples are independent (use Mann–Whitney U / Wilcoxon rank-sum instead)
if differences are wildly asymmetric (the test targets a “shift” under symmetry; consider permutation tests or other robust approaches)
if you have many exact zeros (the sign test might be a better match)
# Synthetic paired data: a small location shift with heavy-tailed noise
n = 22
before = rng.normal(50, 10, size=n)
# Heavy-tailed differences (outliers are plausible)
diff = rng.standard_t(df=3, size=n) * 2.0 + 2.0
# Add a couple of exact zeros to demonstrate the common convention (drop zeros)
diff[[3, 15]] = 0.0
after = before + diff
d = after - before
n, d[:8]
(22,
array([ 4.983304, 1.364337, 2.776162, 0. , 2.892147, -0.95964 ,
3.335939, 2.499282]))
# Paired plot: each line is one subject
x = np.ravel(np.column_stack([np.zeros(n), np.ones(n), np.full(n, np.nan)]))
y = np.ravel(np.column_stack([before, after, np.full(n, np.nan)]))
fig = go.Figure(
go.Scatter(
x=x,
y=y,
mode="lines+markers",
line=dict(color="rgba(31, 119, 180, 0.35)", width=2),
marker=dict(size=7, color="rgba(31, 119, 180, 0.85)"),
showlegend=False,
)
)
fig.update_layout(
title="Paired measurements (before → after)",
xaxis=dict(tickmode="array", tickvals=[0, 1], ticktext=["before", "after"], range=[-0.15, 1.15]),
yaxis_title="value",
)
fig.show()
Intuition: we care about the direction and magnitude of differences#
The signed-rank test keeps the sign of each difference (positive/negative) and uses the rank of its magnitude.
A plain sign test uses only the signs.
A paired t-test uses magnitudes but through the mean (more sensitive to outliers).
Wilcoxon signed-rank sits in between: it’s rank-based (robust) but still rewards consistently large differences.
# Distribution of paired differences
fig = px.histogram(
x=d,
nbins=16,
title="Paired differences (after - before)",
labels={"x": "difference"},
)
fig.add_vline(x=float(np.median(d)), line_dash="dash", line_color="black", annotation_text="median")
fig.show()
fig = px.violin(
y=d,
box=True,
points="all",
title="Differences with individual points",
labels={"y": "difference"},
)
fig.add_hline(y=0, line_dash="dot", line_color="gray")
fig.show()
2) The signed-rank statistic (how it’s computed)#
Given differences (d_1,\dots,d_n):
Drop zeros: keep only (d_i eq 0). Let (m) be the number of remaining differences.
Compute (|d_i|) and rank them from smallest to largest.
If there are ties in (|d_i|), assign average ranks.
Sum ranks for positive differences:
[ W^+ = \sum_{i: d_i > 0} \mathrm{rank}(|d_i|) ]
Sum ranks for negative differences:
[ W^- = \sum_{i: d_i < 0} \mathrm{rank}(|d_i|) ]
For a two-sided test, a common statistic is:
[ W = \min(W^+, W^-) ]
If there’s no shift, positives and negatives should be “balanced”, so (W^+) and (W^-) should be similar.
If the true shift is positive, (W^+) tends to be large (and (W^-) small).
What makes it “exact” under (H_0)?#
Under (H_0), each non-zero difference is equally likely to be positive or negative (a random sign).
So the null distribution of (W^+) is the distribution of a random subset sum of the ranks.
def rankdata_average(a: np.ndarray) -> np.ndarray:
'Rank data with average ranks for ties (like scipy.stats.rankdata(method="average")).'
a = np.asarray(a)
if a.ndim != 1:
raise ValueError("rankdata_average expects a 1D array")
n = a.size
if n == 0:
return a.astype(float)
sorter = np.argsort(a, kind="mergesort") # stable
a_sorted = a[sorter]
# Tie blocks in sorted order
_, first_idx, counts = np.unique(a_sorted, return_index=True, return_counts=True)
ranks_sorted = np.empty(n, dtype=float)
for start, count in zip(first_idx, counts):
end = start + count
# 1-based ranks: start_rank = start+1, end_rank = end
avg_rank = 0.5 * ((start + 1) + end)
ranks_sorted[start:end] = avg_rank
ranks = np.empty(n, dtype=float)
ranks[sorter] = ranks_sorted
return ranks
def signed_rank_components(d: np.ndarray):
'Compute ranks and W+/W- for differences d, dropping zeros.'
d = np.asarray(d, dtype=float)
d = d[np.isfinite(d)]
d_nz = d[d != 0]
abs_d = np.abs(d_nz)
ranks = rankdata_average(abs_d)
w_plus = float(ranks[d_nz > 0].sum())
w_minus = float(ranks[d_nz < 0].sum())
return {
"d": d_nz,
"abs_d": abs_d,
"ranks": ranks,
"w_plus": w_plus,
"w_minus": w_minus,
"m": int(d_nz.size),
}
parts = signed_rank_components(d)
parts["m"], parts["w_plus"], parts["w_minus"]
(20, 180.0, 30.0)
# Visual: each bar is one |difference| rank, colored by sign
ranks = parts["ranks"]
d_nz = parts["d"]
order = np.argsort(parts["abs_d"])
colors = np.where(d_nz[order] > 0, "#2ca02c", "#d62728")
labels = [f"#{i+1}" for i in range(parts["m"])]
fig = go.Figure(
go.Bar(
x=labels,
y=ranks[order],
marker_color=colors,
hovertext=[
f"d={d_nz[j]:.3f}<br>|d|={abs(d_nz[j]):.3f}<br>rank={ranks[j]:.1f}"
for j in order
],
hoverinfo="text",
)
)
fig.update_layout(
title="Signed-rank building blocks (green = positive difference, red = negative)",
xaxis_title="differences sorted by |d|",
yaxis_title="rank(|d|)",
)
fig.show()
3) A NumPy-only implementation (including p-values)#
We’ll implement:
average ranks for ties
the signed-rank sums (W^+), (W^-)
exact p-values when there are no ties (small/medium (m)) via subset-sum DP
a normal approximation fallback (with optional continuity correction)
This mirrors the logic used in standard statistical libraries, but keeps the steps explicit.
def _normal_cdf(z: float) -> float:
return 0.5 * (1.0 + math.erf(z / math.sqrt(2.0)))
def wilcoxon_exact_pmf_no_ties(m: int) -> np.ndarray:
'PMF of W+ under H0 when ranks are exactly 1..m (no ties, no zeros).'
if m < 0:
raise ValueError("m must be non-negative")
total = m * (m + 1) // 2
counts = np.zeros(total + 1, dtype=np.int64)
counts[0] = 1
# Subset-sum DP: counts[s] = number of subsets of {1..m} that sum to s
for r in range(1, m + 1):
counts[r:] += counts[:-r]
denom = 2 ** m
return counts.astype(float) / float(denom)
def wilcoxon_signed_rank(
x: np.ndarray,
y: np.ndarray | None = None,
*,
mu0: float = 0.0,
alternative: str = "two-sided",
method: str = "auto", # auto | exact | normal
correction: bool = True,
):
'Wilcoxon signed-rank test, implemented with NumPy (educational).'
x = np.asarray(x, dtype=float)
if y is None:
d = x - mu0
else:
y = np.asarray(y, dtype=float)
if x.shape != y.shape:
raise ValueError("x and y must have the same shape")
d = (x - y) - mu0
parts = signed_rank_components(d)
m = parts["m"]
if m == 0:
raise ValueError("All differences are zero (after dropping zeros)")
d_nz = parts["d"]
ranks = parts["ranks"]
w_plus = parts["w_plus"]
w_minus = parts["w_minus"]
total_rank_sum = float(ranks.sum())
has_ties = np.unique(parts["abs_d"]).size != m
if alternative not in {"two-sided", "greater", "less"}:
raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")
if method == "auto":
method_ = "exact" if (not has_ties and m <= 30) else "normal"
else:
method_ = method
if method_ == "exact":
if has_ties:
raise ValueError("Exact p-values require no ties in |d|. Use method='normal' or method='auto'.")
probs = wilcoxon_exact_pmf_no_ties(m)
w_plus_int = int(round(w_plus))
cdf = float(probs[: w_plus_int + 1].sum())
sf = float(probs[w_plus_int:].sum())
if alternative == "greater":
p_value = sf
elif alternative == "less":
p_value = cdf
else:
p_value = min(1.0, 2.0 * min(cdf, sf))
z = None
elif method_ == "normal":
mu = 0.5 * total_rank_sum
sigma2 = 0.25 * float(np.sum(ranks**2))
sigma = math.sqrt(sigma2)
w_cc = w_plus
if correction:
delta = w_plus - mu
if delta > 0:
w_cc = w_plus - 0.5
elif delta < 0:
w_cc = w_plus + 0.5
z = (w_cc - mu) / sigma
if alternative == "greater":
p_value = 1.0 - _normal_cdf(z)
elif alternative == "less":
p_value = _normal_cdf(z)
else:
p_value = 2.0 * min(_normal_cdf(z), 1.0 - _normal_cdf(z))
else:
raise ValueError("method must be 'auto', 'exact', or 'normal'")
rank_biserial = (w_plus - w_minus) / total_rank_sum # in [-1, 1]
median_diff = float(np.median(d_nz))
return {
"m": m,
"w_plus": float(w_plus),
"w_minus": float(w_minus),
"statistic": float(min(w_plus, w_minus)) if alternative == "two-sided" else float(w_plus),
"p_value": float(p_value),
"alternative": alternative,
"method": method_,
"median_diff": median_diff,
"rank_biserial": float(rank_biserial),
"z": None if z is None else float(z),
}
res = wilcoxon_signed_rank(after, before, alternative="greater", method="auto")
res
{'m': 20,
'w_plus': 180.0,
'w_minus': 30.0,
'statistic': 180.0,
'p_value': 0.001827239990234375,
'alternative': 'greater',
'method': 'exact',
'median_diff': 1.7795950541624563,
'rank_biserial': 0.7142857142857143,
'z': None}
# Practical usage: compare with SciPy
from scipy import stats
scipy_res = stats.wilcoxon(after, before, alternative="greater", zero_method="wilcox", method="auto")
scipy_res
WilcoxonResult(statistic=180.0, pvalue=0.0025555243609691396)
4) Visualizing the null distribution and the p-value#
For no ties and moderate (m), we can compute the exact null distribution of (W^+).
Below, we plot the PMF of (W^+) under (H_0) and highlight the observed tail probability.
# Exact null distribution (only valid when there are no ties in |d|)
parts = signed_rank_components(d)
m = parts["m"]
has_ties = np.unique(parts["abs_d"]).size != m
if has_ties:
print("This dataset has ties in |d|, so an exact PMF is not shown.")
else:
probs = wilcoxon_exact_pmf_no_ties(m)
sums = np.arange(probs.size)
w_plus = int(round(parts["w_plus"]))
tail_mask = sums >= w_plus
tail_probs = np.where(tail_mask, probs, 0.0)
fig = go.Figure()
fig.add_trace(go.Bar(x=sums, y=probs, name="PMF", marker_color="rgba(31, 119, 180, 0.35)"))
fig.add_trace(go.Bar(x=sums, y=tail_probs, name="tail", marker_color="rgba(214, 39, 40, 0.65)"))
fig.add_vline(x=w_plus, line_dash="dash", line_color="black", annotation_text=f"W+ = {w_plus}")
p_tail = float(probs[tail_mask].sum())
fig.update_layout(
title=f"Exact null distribution of W+ (m={m}) — one-sided tail p={p_tail:.4f}",
xaxis_title="W+",
yaxis_title="P(W+ = w)",
barmode="overlay",
)
fig.show()
# Monte Carlo view: random sign-flips on the observed ranks (works with ties)
parts = signed_rank_components(d)
ranks = parts["ranks"]
w_plus_obs = parts["w_plus"]
n_sims = 50_000
signs = rng.choice([-1, 1], size=(n_sims, ranks.size))
w_plus_sim = (ranks * (signs > 0)).sum(axis=1)
fig = px.histogram(
w_plus_sim,
nbins=40,
title="Null distribution by sign-flip simulation (Monte Carlo)",
labels={"value": "W+"},
)
fig.add_vline(x=w_plus_obs, line_dash="dash", line_color="black", annotation_text="observed W+")
fig.show()
p_mc_greater = float(np.mean(w_plus_sim >= w_plus_obs))
print(f"Monte Carlo one-sided p (greater): {p_mc_greater:.4f}")
Monte Carlo one-sided p (greater): 0.0017
5) Interpreting the result#
What the p-value means#
A p-value is always a statement assuming the null hypothesis is true.
For Wilcoxon signed-rank (paired):
(H_0) says there is no systematic shift in the paired differences (symmetry around 0).
The p-value is the probability of seeing a signed-rank sum at least as extreme as observed if signs were effectively random.
Direction and “how big is it?”#
The test answers “is there evidence of a shift?”, not “how large is the shift?”.
To communicate size + direction, report alongside the p-value:
median difference (robust location summary)
an effect size like rank-biserial correlation:
[ ext{RBC} = rac{W^+ - W^-}{W^+ + W^-} \in [-1, 1] ]
Interpretation (rule of thumb):
RBC near +1: most rank mass is on the positive side (x > y)
RBC near -1: most rank mass is on the negative side (x < y)
RBC near 0: balanced positives and negatives
res = wilcoxon_signed_rank(after, before, alternative="greater", method="auto")
alpha = 0.05
print(f"m (non-zero diffs): {res['m']}")
print(f"W+ = {res['w_plus']:.3f}, W- = {res['w_minus']:.3f}")
print(f"p-value ({res['method']}, {res['alternative']}): {res['p_value']:.6f}")
print(f"median(after - before) = {res['median_diff']:.3f}")
print(f"rank-biserial correlation = {res['rank_biserial']:.3f}")
if res["p_value"] < alpha:
print(f"
Decision @ α={alpha}: reject H0 (evidence of a positive shift).")
else:
print(f"
Decision @ α={alpha}: fail to reject H0 (insufficient evidence of a positive shift).")
Cell In[11], line 11
print(f"
^
SyntaxError: unterminated f-string literal (detected at line 11)
6) Pitfalls and diagnostics#
Pairedness matters: Wilcoxon signed-rank is for paired samples. If your samples are independent, use a different test.
Symmetry of differences: the signed-rank test is designed for a location shift under (approximate) symmetry of (d).
If (d) is highly skewed, consider a permutation test on a robust statistic or model the differences directly.
Zeros and ties:
Many exact zeros weaken the test (less information). Be explicit about the convention (drop zeros vs other treatments).
Many ties reduce the effective resolution; exact p-values become trickier.
Statistical vs practical significance: a tiny shift can be “significant” with large (n); always report an effect size.
7) Exercises#
Simulate paired data where the paired t-test rejects but Wilcoxon does not (and vice versa). What changed?
Construct a dataset with many zeros. Compare:
Wilcoxon signed-rank (drop zeros)
sign test (count positives vs negatives)
Implement a two-sided exact p-value visualization (both tails) and compare it to SciPy’s
alternative='two-sided'.
References#
SciPy docs:
scipy.stats.wilcoxonNonparametric statistics textbooks / lecture notes on signed-rank tests